This assignment is for ETC5521 Assignment 1 by Team Grevillea comprising of Dewi Lestari Amaliah, Yiwen Jiang, (Samuel Lyubic), and (Brendi Ang).
The Tour de France is a cycling tournament that is held in France annually which spans across 21 stages over 23 gruelling days. It is considered one of the most prestigious races that elite cyclists can partake in given the level of difficulty of riding through the ranging beautiful French landscapes (EEB (2020)). Despite being an individual event, riders travel through the whole country in teams of eight in order to strategically position the team leader in the best position to win the general classifications. This report mainly analyses the changes in the Tour de France over the past hundred years and the changes in riders’ characteristics. Our statistical programming used for analysis is R and Rstudio.
Tour De France (“Le Tour”) is an annual men’s bicycle race that in modern days encompasses a 21-day course that covers approximately 3,500 km. It is predominantly held in France while often passing through other countries. (Encyclopedia Britannica (2020)). The teams would pass through long countryside roads, steep alpine regions and tight city areas, covering terrain ranging from rolling hills, long flat grounds and steep mountainous. The types of terrain dispersed across the different stages over the 23 days. The rider that completes the most stages in the shortest amount of time wins the overall title.
Due to the long duration of the race, the intelligence or athletic ability of the riders are not the only factors in winning the race. The key to winning the competition also lies in the choice of strategy. Besides, with technological changes in the past hundred years and people’s living standards have improved, various changes have taken place in the Tour de France. We will present our findings of the Tour de France through the exploration of the recording data from 1903 to 2019.
The original source of the Tour de France data set is from the tdf package written by Rushworth (2020). The data is then provided and available to download in the TidyTuesday’s GitHub repository by Mock (2020). The data contain the competition and winners’ information over 106 years from 1903 to 2019. The dataset prepared for analysis of how the competition changes across over a hundred years. It includes the information of 106 winners and the information of 21 stages for each rider from 106 competitions. There are three data sets regarding Tour de France available in this repository, namely tdf_winners.csv, stage_data.csv, and tdf_stages.csv. The tdf_stages.csv data only covered the period up to 2017, so we got the data from 2018 and 2019 year from Wikipedia (Wikipedia contributors 2020) and (Wikipedia contributors 2020).
Figure 1.1: Visualise the missing value in winners data
full_name, died, and nickname were not really matter since we would not use these variables. However, we would use height, weight, time_overall, and time_margin variables. Hence, we have to handle these missing value. We finally decided to just omit these missing values because these variables are very specific to a person. It is hard to only use, for example an average to impute the values, because the overall time for winner each year must be different and the slight gap of time can affect the result of the competition.tdf_stages data inside tdf package only provided data till the 2017 edition while the other two data covered all the way to 2019. Therefore, when using this data for analysis, we must consider whether to discard the data after 2017 or try to complete the data. If the analysis is for prediction, the data in recent years is relatively important, and it is unreasonable to discard the data in recent years. However, when we choose to complete the data after 2017, it is necessary to ensure the consistency of the data, such as the method to record the data may be varied between different institutions.The data set records competition information of Tour de France published on Alastair Rushworth’s Github. The data recorded the competition information since the first organised in 1903, and 106 competitions been held since then.
The Tour de France is a men’s multi-stage cycling race held annually in France and nearby countries. It was established in 1903 to increase the sales of the newspaper L’Auto. This event has become an important cultural event for European fans.
This dataset includes the information of 106 winners, stages of each competition, and the riders’ information of each stage in the competition. The time frame of the data recording started in 1903 and until 2019. Alastair Rushworth collected the recording information from various websites and other places and then integrated it into the dataset. The dataset was separated into three data files and provided by .csv format. The following are the variables in each data.
tdf_winner data comes from tdf_winners.csv. The data contains information about 106 winners of the Tour de France from 1903 to 2019. The part of the variables is showing in Table 2.1.| Variable | Class | Description |
|---|---|---|
| edition | integer | Edition of the Tour de France |
| start_date | double | Start date of the Tour |
| winner_name | character | Winner’s name |
| winner_team | character | Winner’s team (NA if not on a team) |
| distance | double | Distance traveled in KM across the entire race |
| time_overall | double | Time in hours taken by the winner to complete the race |
| time_margin | double | Difference in finishing time between the race winner and the runner up |
| height | double | Height in meters |
| weight | double | Weight in kg |
| age | integer | Age as winner |
| nationality | character | Nationality |
stage_data data comes from stage_data.csv. The data contains ranking information for each stage of the annual race. The variables is showing in Table 2.2.| Variable | Class | Description |
|---|---|---|
| edition | integer | Race edition |
| year | double | Year of race |
| stage_results_id | character | Stage ID |
| rank | character | Rank of racer for stage |
| time | double | Time of racer |
| rider | character | Rider name |
| age | integer | Age of racer |
| team | character | Team (NA if not on team) |
| points | integer | Points for the stage |
| elapsed | double | Time elapsed stored as lubridate::period |
| bib_number | integer | Bib number |
stage_data data comes from stage_data.csv. The data contains information of each stage for the annual race. The variables is showing in Table 2.3.| Variable | Class | Description |
|---|---|---|
| Stage | character | Stage Number |
| Date | double | Date of stage |
| Distance | double | Distance in KM |
| Origin | character | Origin city |
| Destination | character | Destination city |
| Type | character | Stage Type |
| Winner | character | Winner of the stage |
| Winner_Country | character | Winner’s nationality |
The dataset is primarily used to analysis the changes on the Tour de France over a hundred years. The primary question to answer from this dataset is how the performance of those riders are? After an overview of the primary question, we will conduct more specific analysis through the secondary questions as showing below:
The dataset has used come from the Alastair Rushworth’s Data Package tdf package. The dataset contains information about the overall winning rider for each edition of the race. The winner’s biographical information and the results for each stage in each edition. To install the package, use install_github("alastairrushworth/tdf").
winner data can be imported from editions in the tdf package; we only need to filter out the stage_results variable.stage_data dataset is also import by tdf::editions, the stage information are been nested on the stage_results variable. We use the unnest_longer() function to a rectangle the nested stage data into a tidy tibble, and then use the flatten_df() function to flatten a list of lists into a simple vector. This process is essential because stage data nested in the editions data, we cannot read stage data directly from editions data. Finally, select the relevant variables and use the year() function in the lubridate package to extract the year of each race.tdf_stages data from the tdf package, but it only provided data till the 2017 year while the actual dataset covered to 2019. Thus, we opted to use the cleaning script found in the GitHub Tidy Tuesday page (Mock 2020). Furthermore, our debugging process includes renaming inconsistent variable names, extracting components of date-time objects and using regular expressions to fix the structure of character strings to reproduce the same data set that intended. In addition, we web scraped the data from Wikipedia (Wikipedia contributors (2020) and Wikipedia contributors (2020)) to obtain the stages data set for 2018 and 2019. These Wiki pages correspond to how the actual data set obtained its data; Thus, binding the data was straightforward as the structure the data was analogous with the tdf_stages data set.Expand here to see part of the data cleaning codes
library(tidyverse)
library(tdf) # install at: https://github.com/alastairrushworth/tdf
winners <- tdf::editions %>%
select(-stage_results)
all_years <- tdf::editions %>%
unnest_longer(stage_results) %>%
mutate(stage_results = map(stage_results, ~ mutate(.x, rank = as.character(rank)))) %>%
unnest_longer(stage_results)
stage_all <- all_years %>%
select(stage_results) %>%
flatten_df()
combo_df <- bind_cols(all_years, stage_all) %>%
select(-stage_results)
stage_clean <- combo_df %>%
select(edition, start_date,stage_results_id:last_col()) %>%
mutate(year = lubridate::year(start_date)) %>%
rename(age = age...25) %>%
select(edition, year, everything(), -start_date)
winners %>%
write_csv(here::here("2020", "2020-04-07", "tdf_winners.csv"))
stage_clean %>%
write_csv(here::here("2020", "2020-04-07", "stage_data.csv"))
By overview of the Tour de France competition, first, we began to go through the performance of these winners and also the riders. We use the tdf_winners data to visualise how many times the riders have won the competition. Then use the stage_clean data to see how they perform on each of the stages in the competition.
Figure 3.1: The number of times the rider win the Tour de France
Refer to Figure 3.1; it presents the rank of the number of times the rider won the Tour de France in history. Lance Armstrong has won the competition seven total times, which is higher than most of the riders. This brings us to an interesting question, does the winner achieve high rank or extraordinary performance in most of the stages?
Figure 3.2: The number of times the rider win the Tour de France
The animation (Figure 3.2) displays the changes in the cumulative points of riders in 2019 (the higher rank, the higher points). It only shows the top 15 drivers with accumulated points. The red bar is the winner of 2019, Bernal Egan. It is not difficult to find that the winner does not need to achieve outstanding results at every stage. Compared with the riders who got higher cumulative points, the winners’ cumulative points are only half of theirs.
These figures give us a more in-depth exploration of data guidance. What are the characteristics of Tour de France winner riders? How does the distance and speed of the Tour de France change? Is the Tour de France becoming more competitive?
For sports, the physical fitness of an athlete is the core factor that affects the outcome of the game. Especially for races like the Tour de France, the 23-day race is a severe test for athletes’ physical fitness.
BMI is usually a significant indicator for determining a person’s physical health. For example, a low BMI implies that the person will suffer and experienced weakened immune systems or weakened bones. We combine the height and weight variables of athletes into BMI according to the following formula:
Figure 3.3: BMI of previous Tour de France winners
Figure 3.3 shows the BMI value of the winners of the Tour de France each year. We can observe that the BMI of the winners mostly concentrated in a particular range. The red dash line is representing the range of appropriate BMI for adult males, which is between 19 and 25. In recent years, the BMI of the winners of the Tour de France has a certain downward trend, but this trend is not very obvious.
Figure 3.4: The age of the previous winners of the Tour de France
Age often plays a big factor in the assessment of potential athlete performance. The plots are present in Figure 3.4. The left panel displays the trend of the average age of the winners (average over a decade). We can see that although the average age has a high volatile over the hundred years if we focus on the trend after 1980 the average age of the winners has gradually increased, and the average age of those winners has achieved 29 in the past ten years. The right panel shows the age distribution of all winners. We can roughly see that the age distribution of the winners is mainly between 20 and 35 years old, and the average is about 27 years old. Compared to other competitions, the winners of the Tour de France are relatively older.
Figure 3.5: The number of times the Tour de France country has won
Figure 3.6: The number of times the Tour de France team has won
If we compare the number of winners between countries (Refer to Figure 3.5), the number of France far exceeds the number of second-place Belgium. Figure 3.6 compares the number of winners between the teams. The French team has four more winners than Alcyon-Dunlop, which is ranked second.
It is not difficult to observe that both the number of winners between countries and teams, the number of ranked first are much higher than the following. This is because the Tour de France also focuses on the strategy of the competition. A good team or country has a better strategy that is more suitable for competition. Specifically, the financial backing of these larger teams allow them to attract top tier riders, coaches and staff given the pay packets they are able to provide as well as being able to afford all the state of the art practices that assist performance thus potentially improving their dominance relative to smaller teams ((“What Makes Ineos Unbeatable?” 2019)). This may indicate that a rider may have a better chance of winning the Tour de France depending on the team they have chosen.
The distance covered in Tour de France through the history
It is worthwhile for the next Tour de France contenders to observe how the distance has changed ever since the competition held for the first time in 1903 to get more understanding of the battlefield. The previous report by Ang and Lyubic, which calculated the percentage of the riders who failed to complete each stage, has given us a hint that the most significant difficulty lied in the mountain and hilly stages. Figure 3.7 conveys that 2.13% of all riders that have entered the mountain stage have failed to complete it, whilst 1.61% of all riders that have entered the hilly stage have failed to complete it.
Figure 3.7: The percentage of riders that have not finished the main for stages, since 1969
We then used this information to categorize the distance by the percentage of the mountain and hilly stage. Based on the range of that percentage, we divided the distance into four types, namely
Further, we also add the trend line into the plot using the loess method. Our finding is reported in the following figure.
Figure 3.8: The distance of Tour de France Route from 1903 to 2019
In Figure 3.8, we can see that until 1933, the distance covered on the Tour de France continued to increase. The reason might be because, at that time, this competition was intended to gain the popularity of L’Auto Sport Magazine. The long distance cycling was sensational and could gain the sell of this magazine (Wikipedia 2020). Not surprisingly, in that period, the Tour de France distance can reach more than 5000 km. After 1933, the competition’s mileage gradually declined.
If we look at the marginal histogram on the right side of the plot, we can also infer that the distribution of Tour de France distance is multimodal. In the first three decades of the competition, the most frequent distance is around 5500 km. After that, the most frequent distances decreased to about 4500 km and 4000 km. Finally, it went down again to around 3500 km.
Does this decreasing trend mean that the difficulty level of the Tour de France was also decreasing? Based on Figure 3.8, we can learn that after 1933 to the late 1960s, when the trend decreased, the route was still dominated by 40-49.9 percent mountains, just like before 1933. It means that the terrain did not change that much, and therefore, the difficulty might have been reduced.
In the early 1970s to 2006, as the distance decreased, the route was dominated by more challenging terrain, namely more than 50 percent of mountains and hills. Hence, the shorter distance did not make the competition was easier. We can say that 2007 to 2016 might be a more relaxed period because the trend declined and the percentage of mountains and hills were less than 50 percent of the total distance. In 2018 and 2019, the percentage of mountain and hill stages increased again to more than 60 percent of the total distance.
The speed of the winners. How does the bicycle technology development and doping usage associate the performance?
Figure 3.9: Speed of the winners through the history (1903-2018)
In contrast to distance, cyclist speed has increased throughout the history of the Tour de France (refer to Figure 3.9). We tried to examine the existence of the lurking variables behind this phenomenon. According to Petrocik (n.d.), one of the reasons for cyclists’ increasing speed is because bicycle technology continues to develop. Therefore, we tried to divide the competition period based on the time line of bicycle technology development, which we refer from Homfray (2013). We also mark the period when doping became widespread, and the testing was used, since according to Rivenaes (2020), doping usage also impacted cyclist speed.
The density plot in Figure 3.9 shows that the riders’ speed has a bimodal distribution. Referring to the development of bicycle technology, the shorter peak could be associated with the period before the bicycle’s revolutionary technology was achieved. At that time, the average speed of a cyclist was only about 25 km/hour. After derailleur was discovered in 1938, there was an increase in cyclist speed. It is because cyclists can change the gear to travel through mountains using this tool. However, the invention of lightweight bicycles in early 1970s and disc wheels in 1985 did not seems to be associated with the increased of cyclists’ speed. It can be seen from the speed that tends to be flat.
In 1990 and later, bicycle technology continued to develop. From now on, doping is also illegally used by cyclists to increase their stamina. Further, the highest speed in history happened in 1998, right in this era of doping use. The fastest record was created by Lance Armstrong, who finally admitted that he had also used doping to win the competition. Starting in 2000, when doping use was tested, there was no noticeable change in cyclist speed. We can see that the speed tends to be the same as in the previous period. Only when blood passports were used in 2008, riders’ speed was slower because it tested doping more closely.
With the increase in the number of holding of the Tour de France and the development of technology, the equipment becomes more advanced, and the team’s strategy becomes more mature, which may lead to more intense competition in the competition. In the following, we will analyse whether the Tour de France has become more competitive based on the data.
Figure 3.10: Average hours taken by the winner to complete the race
As showing in Figure 5, the average of total hours taken by the winner to complete the race are increased before 1920 and reached its peak around 1920, which is over two hundred hours. After 1920, the average total hours began to decline, and until the past decades, the average total hours have dropped to less than ninety hours. However, this result is still not enough to support that the riders are getting faster than before and the competition getting more competitive, because the reduction in average total hours is also affected by the distance of the races.
Figure 3.11: Average difference in finishing time between the race winner and the runner up
Then, we can look at the time margin trend (Refer to Figure 6); the line represents the trend of time margin over decades. From the figure, it is obvious that the time margin is decreasing, which means that the time gap between the winner and the second place is getting smaller. By 2010, the time margin was already less than three minutes. This means that regardless of the length of the distance, the performance difference between the riders has become smaller, and the competition has become more intense. A slight mistake may lose the chance of winning.
Through the above analysis, we can conclude that the winners do not have to perform outstandingly in most of the stages, they need to retain physical strength in some stages, and allocate it to other stages reasonably. The common characters of the winners are that the BMI is within a reasonable range for male, and their ages are around 27 because older riders are more experienced in the competition. Moreover, France is more likely to win because of its natural geographical conditions and excellent resources.
The analysis also shows that the Tour de France’s distance has decreased considerably compared to its initial period. Taking the terrain difficulty level into account, we can conclude that decreasing the distance does not necessarily decrease the difficulty level of the Tour De France because the mountain and hilly stages was dominating from early of 1970s to 2006.
In contrast with distance, cyclists are getting faster from time to time. We found that the derailleur discovery was associated with the increasing of cyclists’ speed. Further, after 1990, cyclists’ speed continued to increase. It can be associated with drug use by the athletes. Their speed seemed to decrease slightly when the doping test was carried out more strictly using a blood passport.
As the time difference between the riders getting narrower, the competition is becoming more competitive, and the strategy at each stage of the competition is particularly important.
The limitation of this research is the dataset contains too many variables, there are a relatively low proportion of missing values, so there are many analysable questions that it is challenging to cover all the possible analysis in one report. In addition, we also could not observe all of the variables as the potential lurking variables, so we used only the hint from the previous report and some references from articles.
Thanks for tidytuesday provided the data cleaning process, relevant articles and original data source on GitHub (Hughes (2020)). The dataset used is Tour de France dataset offered by Alastair Rushworth on the tdf package (Rushworth (2020)). Thanks for Samuel Lyubic and Brendi Ang for provided an excellent report as our reference.
Packages used are Wickham et al. (2019a), Wickham et al. (2019b), Sievert (2020), Zhu (2019), Grolemund and Wickham (2011), Tierney et al. (2020), Sievert (2020), Attali and Baker (2019), Pedersen (2020a), Pedersen (2020b), Perepolkin (2019), Wickham (2020), Slowikowski (2020) and Pedersen and Robinson (2020).
Attali, Dean, and Christopher Baker. 2019. GgExtra: Add Marginal Histograms to ’Ggplot2’, and More ’Ggplot2’ Enhancements. https://CRAN.R-project.org/package=ggExtra.
EEB. 2020. “Tour de France.” 2020. https://www.britannica.com/sports/Tour-de-France.
Encyclopedia Britannica, Inc. 2020. “Tour de France Cycling.” 2020. https://www.britannica.com/sports/Tour-de-France.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.
Homfray, Reece. 2013. “How Tour de France Cycle Race Evolved over 100 Years Ago.” https://www.adelaidenow.com.au/sport/how-tour-de-france-cycle-race-evolved-over-100-years-ago/news-story/f759f06f3ae23d4faf4fcba089e37cd6?sv=5bfd117ddc9268e7db9f3cf2b5d0fa1.
Hughes, Ellis. 2020. TidytuesdayR: Access the Weekly ’Tidytuesday’ Project Dataset. https://github.com/thebioengineer/tidytuesdayR.
Mock, Thomas. 2020. “Tidy Tuesday Tour de France Dataset.” https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-04-07/readme.md.
Pedersen, Thomas Lin. 2020a. Ggforce: Accelerating ’Ggplot2’. https://CRAN.R-project.org/package=ggforce.
———. 2020b. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
Pedersen, Thomas Lin, and David Robinson. 2020. Gganimate: A Grammar of Animated Graphics. https://CRAN.R-project.org/package=gganimate.
Perepolkin, Dmytro. 2019. Polite: Be Nice on the Web. https://CRAN.R-project.org/package=polite.
Petrocik, John R. n.d. “Racing at Increasing Speed.” https://www.bikeraceinfo.com/tech/bike-racing-speeds-increase.html.
Rivenaes, Andre Waage. 2020. “Analyzing Tour de France-Data.” https://www.andrewaage.com/post/analyzing-tour-de-france-data/.
Rushworth, Alastair. 2020. Tdf: Tour de France Data. https://alastairrushworth.github.io/tdf/.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Slowikowski, Kamil. 2020. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://CRAN.R-project.org/package=ggrepel.
Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.
“What Makes Ineos Unbeatable?” 2019. 2019. https://www.bicycling.com/racing/a28637007/ineos-tour-de-france/.
Wickham, Hadley. 2020. Rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019a. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
———. 2019b. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wikipedia. 2020. “Tour de France.” https://en.wikipedia.org/wiki/Tour_de_France.
Wikipedia contributors. 2020. “2018 Tour de France — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=2018_Tour_de_France&oldid=960595512.
Wikipedia contributors. 2020. “2019 Tour de France — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=2019_Tour_de_France&oldid=968463007.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.